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ABSTRACT 


Results of a comparative study of F and Q tests, in a 



randomized block design with one replication per cell, are presented. 
In addition to these two procedures, a multivariate test was also 
considered. The model and test statistics, data generation and 
parameter selection, results, summary and conclusions are presented. 
Ten tables contain the study data. (DB) 
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A Comparison of Three Methods of Analyzing Dichotomous 
Data in a Randomized Block Design 
Garrett K. Mandevllle, University of South Carolina 
Introduction and Review of Literature 

Many situations arise in behavioral research v'bere the data fit nicely into the 
randomized block paradigm. For example, when information is available on an antecedent 
variable such as aptitude, subjects may be "blocked" on the basis of these scores to 
gain precision in comparing effects of the independent variable, e.g., teaching methods, 
on the dependent variable, e.g., achievement. Often this randomized block design would 
be an improvement over the completely randomized design. When a single subject responds 
to a series of trials or to varying treatment conditions given in random order the term 
repeated measures design is commonly used in the behavioral literature. In each case 
the experimenter wishes to compare the strength of response for the treatments, trials, 
etc., or test or estimate some contrast of their effects. The methods of analysis are 
the same, however, and this is why the terminology randomized block paradigm was used 
above. 

When the response variable is continuous, analysis of variance (ANOVA) or a 
multivariate procedure would probably be used to analyze the data. A recent study by 
Porter and McSweeney (1970) has considered the advantages of blocking in situations 
where a non- parametric analysis is to be performed. It is this writer’s contention, 
however, that many situations occur in behavioral research where the measurement is 
of such quality as to invalidate any of these techniques. Of particular Interest in 
this study are situations where the dependent variable is dichotomous. A few examples 
are learning trails in which the subject makes either the correct or incorrect assoc- 
iations, maze running in which a rat turns right or left, problem solving where the 
problem is solved or not solved and attitude surveys in which the respondent agrees or 
disagrees. Sometimes the crude dichotomous response is obtained to facilitate speed. 
Numerous examples of dichotomous dependent variables in psychological experiments are 
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given in Seegar and Gabrieison (lose, p. 270). The question of how to analyze data 
in a randomized block design when the dependent variable is dichotomous is not dis- 
cussed with any frequency in the applied statistical literature. 

Cochran (1950) presented the Q test as a procedure for testing the hypothesis 
that in each block , the response probabilities are constant for the various treatment 
levels. (Mote that the terms blocks and treatments will be used for convenience here; 
the reader should understand that blocks may represent a set of subjects grouped to- 
gether or multiple measurements on the same subject, etc., and that treatments, which 
might be more appropriately called treatment levels, are the conditions for which 
comparisons of effects are desired. Also, for definiteness, let us assume that there 
are I blocks and J treatments.) The Q statistic is based on a randomization argument, 
i.e., if the treatments are no different, the u^ positive responses in block 1 could 

have, with equal probability been arranged in any of the (!? ) ways. Under randomi- 

u i 

zation, the responses have the same variances and because of symmetry the covariances 
between any response pairs are also equal. The sum of squared deviations among the 
J treatment totals is, for large samples, shown to approximate a multiple of the x 2 
distribution. 

Let us clarify the hypothesis tested with Q. Let j represent the "success" 
probability associated with the application of the jth treatment in the 1th block. 

Then the null hypothesis of the Q test, is Ho-j : ■ <^ 2 a f° r 2,...., I. 

Note that the hypothesis of ANOVA is of lesser magnitude, i.e., Ho 2 : <F. 2 *...* 

where ^ is the average success probability for the jth treatment. It should be clear 
that Ho 1 is a composite of Ho 2 and the hypothesis of no treatment x block interaction. 
Seegar and Gabrielson (1968) consider data sets with varying amounts of treatment x 
block interaction and observe that, as one would expect, the Q test does have power 
to detect situations where interaction exist, but Ho 2 is true. It is probably true, 
partly because of the unfortunate way Q is handled in secondary sources, that resear- 
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chers do not realize the subtle differences in these hypotheses. Tukey's test for non- 
additivity would provide one method for detecting an interaction of the multiplicative 
type, but would probably be of limited usefulness in most applications in the behavioral 
sciences where blocks are random. Other suggestions are given in Draper and Porter 
(1970). 

llhen Cochran introduced the Q test, he presented some results comparing probabil- 
ities obtained using the x 2 distribution with the exact probabilities obtained using 
randomization. The problems he studied were kept small to facilitate computation, the 
number of treatments ranging from 3 to 5 with from 6 to 16 blocks. In addition to 
computing Q for these data sets, the F ratio for a randomized blocks ANOVA was also 
computed. Cochran also computed Q’ and F* which were values of these statistics correc- 
ted for continuity and they were obtained by finding the next smaller value of the 
statistic and averaging the two. For each of these statistics, then the appropriate 
table was entered and the probability of a larger value was obtained. These were then 

i 

1 

converted to a percentage error by computing ^ 

100 (Tabular P-True P)/(True P) 4 

Cochran stated that the F* was decidedly better than F and so he did not present results 
for F. Of the other three statistics, Q was suggested to be preferred. It exhibited 
a negative bias, i.e., a tendency to underestimate the true probability when true P 
was in the range of .2 to .02 and, overestimate the true probability (corresponding 
to a conservative test) for true probabilities below .02. The corrected x 2 had a 
positive bias over the whole range of true probabilities while F' had a tendency towards 
a negative bias. 

Although it may strike the reader as being strange that Cochran would even con- 
sider the F test in this situation, the following quote is illuminating: "I had once 
or twice suggested to research workers that the F test might serve as an approximation 
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even when the table consists of l's and 0's’...this suggestion was received with 
Incredulity, the objection being made that the F test requires normality, and that a 
mixture of l's and 0's could not by an stretch of the imagination be regarded as norm- 
ally distributed. The same workers raised no objections to a x 2 test, not having 
realized that both tests require, to some extent, an assumption of normality, and that 
it is not obvious whether F or x 2 is more sensitive to this assumption" (1950, p.262). 
Cochran further justifies consideration of F because of the widespread interest in 
the application of analysis of variance to non-normal data and, although his results 
are discouraging with regard to the F test, he notes that the application of F test 
to more complex tables should be kept in mind. 

One important point which Cochran makes, which has been overlooked by most secon- 
dary sources is that Q is Invariant under deletion of what will be termed here as 
"non-informative” blocks, i.e., blocks with responses of all 0's or l's. Therefore, 
the term "large samples" must be used with caution and many statements In textbooks 
such as McNemer (1962) and Siegel (1956) are misleading in this regard. 

A more recent study of the sampling distribution of Q for small samples was done 
by Tate & Brown (1964, 1970). These writers clarify the fact that Q is not changed 
upon deletion of non-informative blocks but point out that the F test is changed by 
this deletion. This is because the degrees of freedom of the residual mean square 
will change with the number of blocks in the experiment thus affecting both Its value 
and the reference F distribution. Using data from Flelss (1965), Tate and Brown com- 
pare results using F and Q and the exact probabilities from randomization. When no 
rows are deleted, the percent errors of F and tj are similar for the full table of the 
Flelss data but when non-informative rows are deleted, this one set of data suggests 
that the F test has a negative bias. Tate and Brown go on and tabulate the exact 
distribution of Q for designs with 3 treatment and from 3 to 12 blocks; 4 treatments 
and from 2 to 8 blocks; 5 and 6 treatments and from 2 to 5 blocks. The tables are 
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presented in Tate and Brown (1964). By comparing Q to the exact probabilities for 
these tables these writers observe a negative bias in Q, but suggest that, when the 
product of the number of (informative) blocks and treatments is 24 or more, the 
approximation using Q is probably sufficient for most practical work. This is based 
on the observation that, for the distributions they considered for which this was true, 
median percent errors were in the range of 12% to 20%. For situations in which the 
product is less than 24, the exact tables of Tate and Brown may be used. Tate and 
Brown give no comparison of F and Q for these distributions. 

The other rather extensive study reported in the literature concerning Q is the 
one already mentioned by Seegar and Gabriel son. These researchers were Interested in 
extending the Q test to a situation in which there was more than one measurement avail- 
able for each treatment-block combination. An extension of Q to this situation was 
presented and it was compared to the ANOVA F test using the treatment x block mean 
square as the error term. Although they did not consider it, a point in favor of the 
F test in this design is that it allows both treatment effects and treatment x block 
interactions to be tested whereas the extended version of the Q test which they sugges- 
ted tested the same hypothesis as Cochran's original test. The results of this study 
suggest that where Ho^ is true, Q (or the modification) can be used when the product 
of the number of blocks and the number of replications per treatment is in the range 
10 - 20. The above test, however, requires that treatment x block Interactions be 
null for otherwise HOj is false and the Q test will have power to detect this. In 
these situations the F test, to test Ho 2> will still nrovido a reliable answer for 
from 5-10 subjects. Seegar and Gabrielson point out that when Ho<| ( and therefore 
H 02 ) is true, the F test is as good as Q. The arcsin transformation was found to pro- 
vide results which were very similar to those of the F test. 

Although Cochran's results with F (and F') suggest that it leads to underestimates 
of the true probabilities, it needs to be said that Seegar and Gabrielson did not find 
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this to be true in their study. There is a subtle difference in the methods used by 
Seegar and Gabriel son and by this writer on the one hand and by Cochran and Tate and 
Brown on the other. The method used here was to define critical regions using the 
reference F or x 2 distributions and tallying the instances of a significant value for 
the statitstic. Thus, what becomes important is how the empirical (discrete) and 
theoretical (Forx 2 ) cumulative distributions compare at the selected points. In the 
study to be reported here, the 90th, 95th, 97.5th and 99th percentiles of the F and x 2 
distributions were used to designate lower boundaries of critical regions for a * *10, 
.05, .025 and .01. These were selected because they cover common significance levels 
employed by educational researchers. 

This writer decided to take a closer look at the properties of the F test in a 
randomized block design with one replication per cell. This comparative study of the 
F and Q tests seemed in order since the only comparisons available in the studies cited 
above were limited to the few examples of Cochran and Tate and Brown and the few 
simulations for one replication of Seegar and Gabriel son. The research is restricted 
to one replicate because it was anticipated that researchers with more than one obser- 
vation per cell would use an F test to allow for a test of interaction. 

Some writers have suggested that recent results of Hsu and Feldt (1970) and 
Lunney (1970) for ANOVA with dichotomous data in Independent cells designs can be 
applied to ANOVA in designs where the data are correlated. Because of the differences 
in fixed and mixed ANOVA models, this writer suggests that a certain amount of caution 
be exercised here. It is the feeling of this writer that the F test should be the 
recommended procedure if it simply provides results which are as good as Q because; 

(1) it Is more flexible than Q and, therefore, more promising for complex designs, 

(2) it is so frequently used, and therefore probably fairly well understood by educa- 
tional researchers and !3) many computer programs are available to facilitate the 
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analysis. For researchers who are mainly concerned with differences in average 
success proportions » some form of F test should be recommended over Q because of the 
power of the Q test if treatment x block interactions exist. 
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The Model and Test Statistics 

A common specification of the model Is * u+a i + Bj + where the a i are 
block effects and are randomly sampled from a normal population with mean zero and 
variance a \, the 8j represent fixed treatment effects (z^'O) and residual variation 
Is Included In the Sij . Tne Elj are from a normal population with mean aero and 
variance o\ and the a 1 and the are Independent (that Is, within and between sets). 
This model leads to correlations between measurements within a block which are the 
same and this, therefore, becomes an assumption In the “univariate" ANOVA solution of 

the problem. 

M i thout noin? into tho detail of reproducing computational f omul as, ’biefc are 
well known to most, some comparisons will bo dra’"n between the 0 and F statistics. 

In order to compute tho 0 and the F statistic, the folloHno t‘*o sums of squares were 
obtained; SSp tho sum of squares due to treatments, and SS £ , thc^error sun of squares 

Then formulas for Q and F are Q = I(J-D Sy[S$ T + SS £ ] and F => jJ- “ [SS T /(J-1)]/ 

*• 2 T 

[S$ E /(I-1)(J-1)]. The null distribution of the Q static approaches x j.-j as i 
Increases (this assumes that some “Informative" vectors have a positive probability) 
and the F statistic was compared to the F distribution with J-l and (I-1)(J*1) degrees 
of freedom. Manipulations carried cut to allow comparison of F and 0 yield 

(!/(J-l) * MSj 

[(1-1) w e +ms t jvi 

This statistic, of course, can be referred to the F J _ 1 ^ distribution. Paralleling 
a discussion of D'Agostino (1971) we observe that the denominators of F and Q/(0-l) 
differ only slightly, and we will expect that, for large numbers of blocks, test size 
results for the two methods will be very similar. However, It appears as though, 
when treatment affects are non-null, the denominator will be inflated and power of 

Q relative to F, will be lowered* 

A more general model than the one above Is to consider the observations In a 
block, the Y n , V 12 Y w , to be a vector observation from a J-varlate normal 
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distribution with mean vector y w and general covariance matrix The hypothesis 

~y y 

of interest. In this more general case, Is that the elements of are equal, l.e,, 
that iyi * Wyg * . .. * Wyj* The problem Is solved by transforming the Y's to a 
space of 0-1 contrasts (differences are the most straightforward) using a transfor- 
mation matrix C which has 0-1 independent contrasts as rows. Then X * CY and the 

9 

hypothesis above becomes y x * Q which can be tested using Hotelling's T statistic, 
T* * I jfs m % Actually a simpler computational form is given by Rao (1965) as 

A 

T^I(I-l) [| A 4T X \/ | A | -13 where A is the sum of products matrix for the 
transformed variables. The multivariate statistic used here was [1-0+1] T^/[(I-1) 
(0-1)] which Is distributed as an F with 0-1 and 1-0+1 degrees of freedom If the 
hypothesis Is true. The reader Is reminded that I in the above formulas Is the 
number of blocks not the identity matrix. 



Oata Generation and Parameter Selection 



The generation of a dichotomous response ;*ector requires a researcher to specify 
a model which allows various degrees of dependency to be manifest In the responses. If 
this study were dealing with continuous variables 9 the multivariate normal distribution 
would probably be the model used because of the many statistical procedures which 
assume that the data are sampled from It. In this study a multivariate normal distri- 
bution was assumed to be latent In the data, l.e., the assumption was that underlying 
each response was a normally distributed continuous random variable. This continuous 
variable was then compared to a fixed "cutting score" and If the response surpassed 
It a one was recorded, otherwise the response was taken to be a zero. The cutting 
score was determlnded by the success proportion associated with the particular mea- 
surement Involved. 

The problem of selecting the multivariate normal distributions to be used for 
the generation of the latent variables amounted to selecting correlation matrices, 
because the means and variances for the dichotomous variables are determined by the 
success proportions used. When only two measurements are made In each block, only 
one correlation needs to be specified and the four values taken for this parameter 
were .0, .2, .5 and .8. The use of the zero correlations, or no association between 
the two measurements, will provide Information on the extent to which applying an 
analytic technique for correlated data will penalize the researcher when In fact 
the data are unrelated. The use of a maximum correlation of .8 was justified because 
It Is unusual when larger correlations occur In practice In educational research. 

For more than two measurements, however, more than one measure of association 
needs to be selected. For three measurements, for example, there are three pair- 
wise correlations which need to be specified. For the major portion of the study, 
these pairwise correlations were all taken to be ewe 1 and again the values of .0, 





.2, .5, and .8 were used. Wiat this means Is that for the latent variables for most 
of the results presented nere, the ANOVA assumption of equal correlations among the 
variables was satisfied. For the portion of the Investigation dealing with type I 
error, l.e., when the success proportions for the treatments were equal, the popula- 
tion covariance matrices for the dichotomous variables also satisfied this assumption. 
However, when power was Investigated, l.e., the success proportions were not the 
same for all treatments, the covariance matrix does not satisfy the pattern assumption 
of equal variances and covariances. 

The number of treatments In a block was taken to be 2, 3, 6 and 10. Again, It 
ivas thought that this would provide a range for this parameter which would Include 
most practical cases In educational experimentation. An exception would be a situation 
where each response was an Item response In a test. In this case, of course, tests 
with more than ten Items would be commonplace. However, It Is not generally of 
Interest to a researcher to determine whether the Items In a test are of the same 
difficulty. The number of blocks was taken to be 5, 10, 20, and 30. It was antici- 
pated that whatever large sample effects which were to be observed would be manifest 
for samples of $l 2 e 30. 

For the Investigation of type I error, the null hypothesis Is true, and the 
success proportions for all treatments are the same. To span the range of success 
proportions from 0 to 1 Is unnecessary since by redefining success and failure, 
results for the range 0 to .5 can be applied to the range .5 to 1. The values of , 
the constant success proportion, which were used In this study were .1, .3 and .5. 

For success proportions smaller than .1, It Is anticipated that, unless sample sizes 
are extremely large, a researcher should probably consider some alternative method 
of analysis. 

Taking all combinations of the four values of J, the number of treatments (2, 3, 
6, and 10) and I, the number of blocks (5, 10, 20, and 30) sixteen different designs 




were Initially to be considered. Due to limitations on computer time, the 10 treat- 
ment by 30 block design was eliminated. For each of the «^ema1n1ng fifteen designs, 
simulations were run for each level of p, the correlation among the measurements in a 
block (.0, .2, .5 and .8) In combination with each value of <f>, the constant success 
proportion (.1, .3, and .5). Therefore, twelve simulations occurred for each design. 

In addition, for 3 and 6 treatments, a non-patterned correlation matrix was constru- 
cted and this was used In conjunction with each of the three $ values. These runs 
were limited to designs with 10 and 20 blocks. 

To describe how the simulation took place the following listing of the steps 
Involved In the determination of empirical test size Is presented; 

1. Values of I, the number of blocks, and J, the number of treatments, were 
set. 

2. A correlation matrix R either constant with Intercorrelatlons of .0, .2, 

.5, and .8 or a non-patterned matrix In a few special cases, was specified. 

3. The vector of success proportions with all elements equal to 4> * .1, .3 
or .5 was selected. 

4. A sample vector was generated from a multivariate normal distribution with 
covariance matrix R. 

5. Each response was converted to a one If It surpassed the (1-<f>)th percentile 

of the standard normal distribution. Thus for any latent variables larger 
than the 70th percentile of the standard normal distribution, l.e., larger 
than .52, were converted to a ones otherwise a zero was recorded. In this 
way, each continuous response vector was converted to a vector of 0*s and 
l*s. 

6. The quantities necessary for computation of the three test statistics were 
accumulated. 

7. Steps 4-6 were applied for data for each of the I blocks. 
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8. The values of the three statistics, Q, F, and f « were obtained. 

9. The boundaries for frequency distributions for each of the three statistics 
were computed. Of Importance here are the 90th, 95th, 97.5th and 99th 
percentiles which were obtained for each of the three reference distributions. 

This step In the program was bypassed after the first data set v/as generated. 

10. The computed statistics were cast Into the frequency distributions set 
up In step 9. 

11. Steps 4-10 were performed 1000 times. 

The resulting empirical proportions above each of these percentiles are to be taken 
as estimates of true type I error (test size) for the test procedure. 

For consideration of power the only alteration In the procedure was that. In 
step 3, a non-null proportion vector was read Into the computer which would, under 
certain normal theory considerations, yield power of .60 and .80 for type I error 
of .05 or .01. Thus, the four combinations of & and 1-0 required four simulations / 

i 

for a given design and R matrix. 

Refore discussing the results a word about how non- Informative blocks were 
handled In this Investigation Is In order. From a logical point of view, these data V 

vectors supoort the null hypothesis since, within them, no differences between the 
treatments are manifest. If the measurement scale had not been so crude, these scores 
would probably have been "closer together" than scores for blocks which exhibited 
variation. The retention of these vectors, which causes the estimated probabilities 
to Increase and, therefore, the statistical tests to become more conservative. Is one 
way to take account of such data. The retention of degrees of freedom In the denom- 
inator of F for these vectors Is along the lines of Cochran*s suggestion that, since 
F as he computed It tended to be liberal and x 2 » with essentially Infinite degrees 
of freedom In ti.e denominator, tended to be conservative, possibly some artificial 
number of these "non-lnformatlve" vectors would yield an empirical distribution which 
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provided a better fit to Its theoretical counterpart. This researcher was not con- 
triving to find out whether an appropriate mixture of these non-lnformatlve vectors 
could be Identified but rather If, when sampling over a wide range of parameters, this 
effect that Cochran suggested does actually tend to produce better results using the 
F test rather than the Q test. Along these same lines Meyer (1967) has shown that the 
unconditional size of McNemar's test (Q test for 0 « 2) Is less than nominal a unless 
"non-lnformatlve" vectors have probability zero. All of these considerations led 
this Investigator to retain the ordinary degrees of freedom In the denominator of the 
F test used. 
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Results 



In table 1 the reader will find summary results of the main portion of the 
investigation dealing with test size. The values in the table designated by AVE are 
averages of the proportions of times that the computed statistic exceeded the corres- 
ponding percentile of the appropriate reference distribution. The overages were taken 
over the 12 runs coming from the combinations of the four levels of Interdependency 
among the observations in a block (p) and the three levels of success proportions ($). 

The rows of this table denoted PCT are the average relative (percent) errors. Again 
each average is based on 12 relative errors and, for a given run, the relative error 
is (p-a)/ct where p is the empirical type I error and a Is the nominal type I error. 

The quantity F is the average of these relative errors for these four selected upper 
percentiles. 

Looking at the F values, an immediate observation Is that, using this measure 

of fit of the empirical and theoretical distributions, the F test has a smaller 

average error than either Q or M for each design considered. It is also true that 7 

Q out performs M on this basis. Comparing Q and F for a moment we observe that the 

advantages of F over Q are largest for designs where the number of treatments Is 

, V 

large relative to block si2e. For example, the largest difference in the F values 
is for the 10 treatment by 5 block design where F is 60% for the 0 test and 29% for 
the F test. 

Looking at the relative error as a function of nominal type I error we observe, 
as might be expected, that the errors Increase as we go further out in the tall of 
the distribution, i.e., the fit of the empirical and theoretical distributions is 
poorer for o ■ .01 than .05. This trend is more noticeable for Q than for F and, to 
a large extent, this accounts for the smaller F for F for designs with two or three 
treatments. As a matter of fact, for two and three treatment designs, AVE and PCT 
values are almost identical for a * .10 and .05. The finding of Cochran that Q has 
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a positive bias above the 98th percentile is substantiated; the largest value of 
AVE for a * .01 is .008 for the 10 x 20 design and most of the other values are sub- 
stantially smaller. Although the corresponding value of AVE for the F test only 
achieves .01 for the 10 x 20 design, the majority of values for designs of modest 
size were as large as .008. Finally, the expected result that as the number of blocks 
Increases the fit of the empirical and theoretical distributions is better, is observed 

This writer somewhat arbitrarily selected the value of 20% relative error as a 
value which may be reasonably allowed in most educational experimentation. For tests 
at the 5% level this would correspond to average test size between .04 and .06. Con- 
sidering 5% level F tests, the designs which satisfied this criterion led to the 
simple rule of those with 60 or more total observations, i.e., 2 x 30, 3 x’ 20, 3 x 30, 
6 x 10^ etc. As pointed out above, for a • .05, F and Q do not differ for two and 
three treatments. It is also true that for six and ten treatments, the differences 
are minor so that, by bending the rule to allow PCT values of up to 22%, all of these 
designs satisfy the criterion for the Q test. As has been noted earlier, the F test 
tends to provide a better fit than Q in the upper tail of the distribution. However, 
for a * .01, no simulations produced PCT values of less than 20% for either F or Q. 

For the F test, for the eight designs with 60 or more observations these relative 
errors range from 26% to 47% with a median of 35%. The corresponding minimum and 

maximum Values and median for Q are 30%, 60% and 45%. 

Not much has been said about the multivariate test using the test statistic 
denoted as M. It should be clear that, on the basis of the summary results presented 
in this table, the multivariate test has little to recommend It. As the discussion 
proceeds, certain characteristics of the multivariate test will be noted. 

Although these results were somewhat encouraging, this writer noted that sample 
size recomnendatlons based on these data would be misleading. The reason for this 
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is that, certain values of the parameters used in this investigation lead to "effective' 
sample sizes which are considerably less than the actual number of sample observations 
generated. The point is that "non- informative" responses, have no affect on the 
computation of either Q or the sums of squares for the F statistic. In Cochran* s inves- 
tigation of the small sample distribution of Q, he used eight different data sets, 
usually with about 3 or 4 treatments and 10 blocks. Cochran used only "informative" 
data in his investigation and found average errors of about 14% for .05 level test. 
Thus, with about 30 to 40 total observations, Cochran reported results which were 
somewhat better than those obtained here with 60 or more total observations. 

In an attempt to bring the results of Cochran and those summarized here into 
closer agreement, this writer developed the notion of using "effective" sample size 
(N ) as a criterion on which to categorize the 12 runs for each design. Effective 
sample size in this context refers to the number of "informative" response vectors 
generated. Since IL is a random variable for a given set of parameters p and $, 
the quantity which was selected to be used as a gross index of the number of "inform- 
ative" responses was the expected or average value of N e , which will be denoted E(N e ). 

The computation of E(fL) was carried out In the following manner. First, for 
a given configuration of p and <j», it Is necessary to know the probability of a 
"non- Informative" response (n^) - For some cases, this could have been done easily 
by hand. For example, for $= .5 and p * .0, the probability of either (0,0) or (1,1) 
response in a two treatments design is .5^ + .5^ * .5. This calculation is straight- 
forward since p « 0 which, for the normal distribution, implies Independence of the 
continuous latent variables. It is readily seen that the two binary variables are 
also Independent. For those cases when p was not zero, the probability of a 
"non-lnformative" response vector was estimated by generating 10,000 such samples on 
the computer. This method seemed to be sufficient considering the purposes for which 
this information was being obtained. 
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Using the value of n H . the effective sample size should follow a binomial distri- 
bution with the "nominal" sample size and l-n N as parameters. For example for 4 * .5, 
p = 0 and "nominal" sample size of five, the distribution «f N 0 Is: 

Effective Sample 012 3 45 

Size (H,.) 

Probability Pr(Nj 1/32 S/32 10/32 10/32 5/32 1/32 

W 

The expected value of N g was then computed In the usual fashion as E(N Q ) *EN e Pr(N e ). 

In the example given this computation yields an expected sample size of 2.5. The 
careful reader will observe that. In some simulations. It Is likely that all vectors 
will be non-lnformatlve. This situation leads to an Indeterminate value for all 
three of the statistics. For the data on test size presented here, these data sets 
were taken as supporting the null hypothesis, and therefore. In the empirical size 

computations, 1000 Is retained as the base. 

By comparing the results of Individual runs to the E(N e ), this writer decided to 

Isolate for further consideration those runs with E(N e ) greater than six. That these 
runs are well behaved. Is verified by Table 2 In which average empirical size results 
are given for those runs which satisfied the criterion. He note that the results 
are more In line with those presented by Cochran. In fact, the median relative error 
for Q for o « .05 Is 14%, the figure which Cochran reported. Again, the phenomena 
that F and Q have similar characteristics fora * .10 is verified. For smaller 
values of a, however, the median average test size and relative errors for F and Q 
become Increasingly different. For the two smallest values of nominal a, the relative 
error of Q Is smaller than that of F in only one comparison. 

For F the largest relative error for a* .025 Is 22% and most of the errors for 
a ■ .01 are 30% or less. For the multivariate statistic we observe that for 3 treat- 
ments the procedure Is fairly well behaved but that for 6 or more treatments, although 
AVE values are close to nominal a In some Instances, percent errors are very large. 
This Is due to a strange mixture of runs, most of which produce empirical type I error 
which either grossly exceed or underestimate the nominal values, but which produce 
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averages which are reasonably close to those values. One of the reasons that the 
results for the 6 and 10 treatment designs with 20 or fewer blocks lead to conservative 
procedures is that there were many instances when M was indeterminate. These were 
counted, the reader will recall, as instances for which the null hypothesis would be 

accepted. 

An alternative method of determining whether there are any systematic tendencies 
for either of these procedures to be biased, is to count the instances in which the 
empirical proportion is larger than the nominal a. This was done for the Q and F 
test for each of the four values of a and the results are presented in Table 3. Since 
the multivariate test using M had exhibited such poor characteristics up to this 
point, results for it wore not tabulated. An asterisk in this table signifies that, 
for more than half of the runs, empirical size exceeded nominal <*. The columns 
headed "chance" in this table are simply one half the number of runs and Indicate 
what would be expected if the nominal type I error values were the medians for the 
empirical proportions. The overall results indicate that, with the exception of the 
F test for a * .10, for both procedures the empirical proportions tend to be less 
than the nominal values. However, the results for F are much nearer to the chance 
results for all nominal a. When these data are exhibited by number of treatment 
levels, the general statments made earlier are verified. For 2 or 3 treatments, 
the main advantage of F over Q is for a less than .05; for 6 or more treatments, 

F exhibits less bias over the whole range of a under consideration. The slight 
tendency for empirical $l2e to exceed nominal a for F for 3 treatments at a * .10 
does not appear serious. There Is also a tendency for the advantage of F over Q 
to diminish, except for small a, for designs with 30 observations. Two summary 
statements which seem in order are that (1) the F test provides a better approx- 
imation to nominal a for small a (a less than .05) and (2) for designs with the 
treatment to block ratio large {e.g., 6x10, 6x20, 10x10 and 10x20) the F distribution 
also provides a better fifc for the other nominal type I errors under consideration. 



Although results of Individual runs will not be presented here because of the 
extensive number of tables which would be involved, a brief discussion of major 
points of Information which they provide will be given. They are: (1) the runs 

excluded by the expected sample size greater than six criterion were the runs with 
large p and/or small values. For all three procedures described here, these tests 
were conservative. (2) For some smaller designs such as the 2x10, 2x20, 3x10 and 
6x5, results using F will be reasonable if the data are not "pathological*. 1 . That is, 
the fit of the variance ratio distribution appears to be adequate If the occurence 
of a success is not very rare and at the seme time the variables are highly related. 
(3) Results for the M statistic are very inconsistent for different combinations of 
p and For mildly correlated or uncorrelated data, the M statistic exhibited a 
very serious tendency for empirical size to grossly exceed nominal type I error. 

This tendency Increased with block size and was not diminished as the number of blocks 
Increased. Probably the most serious instance of this was for the 6x30 design and 
the p * .0, $ * .1 run where the four empirical proportions were .226, 162, .101 and 
.049. The average relative error here is 273%. Although the writer has presented 
more complete information on the M statistic elsewhere (Mandeville, 1969), they have 
not been presented here in the Interest of space and also because of the deficiencies 
already noted in the procedure. 

Additional runs of test size were made for the three treatment and six treatment 
designs using non-patterned correlation matrices. These matrices were not chosen to 
be particularly exceptional, and, when taken in combination with the * * .5 values, 
yielded e values which were approximately .97 for three treatments and .95 for six 
treatments. The quantity e. Introduced by Box (1954a, 1954b), is a measure of 
deviation from pattern and e ■ 1 for patterned matrices. Sample sizes of 10 and 20 
were used in combination with null success proportions of .1, .3 and .5 and these 
results are tabulated in tables4 and 5. Results of these runs are in reasonable 
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agreement with those for constant correlation matrices. For example, for three treat- 
ments the average correlation In the non-patterned matrix was near .5 so that compar- 
isons with results for the runs with constant correlation of .5 are suggested. For 
the F statistic and the 3x20 design, the average errors are 24%, 18% and 11% for 
* » .1, .3 and .5, respectively. Although not tabled here, the corresponding values 

for a constant a of .5 are 3595# 16% and 21%. 

For the 6x10 design, however, the Indication Is that the non-pattemed correlation 

structure lead to more conservative results for * = .1 and .3 than the corresponding 
results for a patterned correlation matrix with p = .2. The average correlation In 

the six treatment correlation matrix is about .3. 

Designs which were investigated with respect to empirical power included those 
studied as regards test size with the exception of designs with 5 and 30 blocks. Thus 
designs with 2, 3, 6 and 10 treatments were Investigated for sample sizes of 10 and 

20 blocks. Sample size 5 was eliminated since the results on test size "ere generally 
negative for such small samples unless a large number of treatments was Involved. It 
was also the feeling of the writer that elimination of samples of size 30 would not 

greatly reduce the Implications of this phase of the study. 

Mon-null vectors of proportions were obtained which exhibited linear departure 
about the central value 4> c * .5 and which would give theoretical (normal theory) power 
of .60 and .80 for tests run at the 5% and 1% levels. Although the normal theory 
assumptions were certainly not appropriate for the situation, this method was used 
so that, in the event that empirical power values were In agreement with the nominal 
values. It would be possible to recommend that a researcher use standard procedures 
for sample size computations. The constant correlation p was varied as before, taking 
the values of .0, .2, .5 and .8. To allow some generalization of the results, the 
combination * c - .3 and p - .2 was also Included. For some sample sizes, no set of 
proportions between 0.0 and 1.0 could be found which satisfied certain size and power 
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combinations. This only occurred for $ c * .3, however. In addition, the non-patternec 
correlation matrices were used in conjunction with * c * .5. Examples of some non-null 
proportions vectors used are given in Table 6. The reader interested in the details 
of the method used to obtain the non-null proportions vectors is referred to Mandeville 

(1969). 

Because of the poor results obtained for the multivariate test, no power results 
will be presented here for the M statistic. Tables 7 through 10 contain power results 
for* the Q and F tests. Dashes in these tables indicate that linear non-null propor- 
tions vectors do not exist for that combination of a» 1*6 snd the other parameters. 

The results indicate that the F test is more powerful than Q for the designs with 
2 or 3 treatments if a 1% level of significance is used and for either the 5 % or )% 
level for designs with 6 or more treatments. Of course, these results parallel those 
found in the earlier part of the investigation and are, therefore, not surprising. 
However, it is also noted that the F test yields empirical power which is in good 
agreement with, although generally slightly less than, normal theory power. 

For the F statistic, the largest deviations of empirical from normal theory 
power results occurred for the two treatments designs. For both sample sizes, the 
largest average deviation occurred when normal theory power was .60 for 1% te$t$. These 
average empirical power values are .529 and .570 and represent deviations of .071 and 

.030 from the normal theory value of .60. 

The ranges of the observed empirical power values were similar for F and Q and 
decreased for larger numbers of treatments so that they were seldom larger than about 
.050 for 6 and 10 treatments. This result is also consistent with the facts brought 
out earlier that Q is testing the more general hypothesis which includes treatment x 
block interaction and that the denominator of Q may be slightly inflated by the mean 
square for treatments. This effect should be most readily observed when the treatment 
mean square is large and the number of blocks is small, 1.e.» for ot * .01 and 1*10. 
Tables 7-10 verify that in these cases the advantages of F over Q is greatest. 



The limited attempt to general Ize the results using $ c * .3 and the non-patterned 

correlation matrices have produced results which are In reasonable agreement with those 

for 4 » .5 and patterned correlation matrices, 
c 

Summary and Conclusions 

Considering test size, designs with 60 or more total observations were found to 
lead to average relative error for F and Q of about 20% or less for 5% tests. Results 
for F and Q were similar for & • .05 for 2 or 3 treatment designs. For designs with 
6 or more treatments, the F test lead to to empirical size closer to nominal <* than did 
the Q test. For a ■ ,01 the F test out performed Q but relative errors for both 
statistics were often as large as 40%. Uhen only designs and parameter specifications 
for which the average effective sample size was 6 or more were considered, the results 
were in good agreement with those reported by Cochran. For these cases the advantage 
of the F test for a * .01 was again observed. When non-pattemed correlation matrices 
were used In conjunction with small null $ values, there was a slight tendency for the 
resulting test procedures to be more conservative than those with patterned correlation 

matrices. 

As would be predicted from the results on test size, when power was considered, 
the F test proved to be more powerful than Q for a ■ .01 and for designs with 6 or 
more treatments this effect was observed for 5% tests also. The empirical power for 
F was generally slightly less than the nominal value but the results were close enough 
so that the use of standard parametric procedures to estimate sample size requirements 

seems justified. 

This research was begun In hopes of allowing a recommendation that ANOVA procedure 
be used Instead of Q for dichotomous data In a randomized block design. In addition 
to these two procedures a multivariate test was also considered. In the comparisons 
that have been made, the F test has: 
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1. Provided, for 5% tests, empirical size which has been as close to the nominal 
value, or closer to It than has been obtained for either Q or M. 

2. Yielded empirical size closer to nominal size for a * .01 than has been 
obtained for either 0 or M. 

3. Provided a maximum average percent error of about 20% for 5% tests when the 
total number of observations Is 60 or more. 

4. Yielded a median average relative error of about 10% for a * .05 and 25% for 
a * .01 for designs with average effective sample size of 6 or more. 

5. Proved to be more powerful than Q for 1% tests for all designs considered. 

6. Proved to be as powerful or more powerful than Q for 5% tests. 

7. Yielded empirical power which was In good agreement with power predicted from 
normal theory calculations. 

On the basis of these results the F test Is recommended over Q or M when a]i of 
the following situations are met: 

1. The researcher Is mainly concerned with comparing average treatment effects. 

2. Sixty or more total observations are available. 

3. The Interrelationships between the variables may be assumed to be reasonably 
constant. 

4. The average success proportion Is In the range .1 to .9. 

5. The data might reasonably be thought of In terms of a normal ogive or 

logistic scaling model. 

6. True type I errors may deviate by about 20% relative error for a * .05 and 

by 40% or less for a * .01 tests. By way of warning the reader should realize 
that, for either the F test or the Q test, certain large data sets can lead 
to results which deviate considerably from those obtained by the exact 
procedure. Thus, as pointed out by Tate and Brown, "Hhen the true slgnlflcanc 
level Is needed. It would seem neccessary to construct the exact sampling 



distribution." (1964, p. 18) 

7. If power against linear non-null proportions vectors Is of Interest to the 
researcher. It Is suggested that sample size computations based on normal 
theory considerations can be recommended. 

The writer feels that these rules are somewhat conservative but suggest that further 
work possibly of an analytic nature, te done to determine the extent of the dependence 
of the results on points 3 and 5 above. It Is hoped that work on generalizations to 
two treatment dimensions would also be forthcoming. 
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Table 1 — Averafe empirical test size and relative percent error for all designs. Values 
have been averagtd over the 12 runs for each design. 
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Table 2 — Average empirical test size and relative percent error for designs and combinations of parameters 
which lead to expected sample sizes of 6 or more. 
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p, astPrisk(*) in the above table signifies that, for more than half of the runs, empirical test size exceeded 
nominal a 



Table 4 — Empirical upper tall probabilities for three test statistics 
for the 3x10 and 3x20 designs using a non-patterned correlation matrix. 
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Table 5 — Empirical upper tall probabilities for three test statistics 
for the 6x10 and 6x20 designs using a non-pattemed correlation matrix. 
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Table 6 — Examples of non-null proportions vectors used in the investigation of 
empirical power. 
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correlation matrix. 



Table 7 — Empirical power for the Q and F tests for designs with 
2 treatments. Non-null proportions vectors would yield theory power 
for the F test of .60 and .80 at each of a * .05 and .01. 
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Mote: On this and succeeding tables the dash indicates that, 

due to the restrictions on the 4 -values, no ron-null vector exists. 
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Table 8 — Empirical power for the Q and F tests for deslcin with 3 treatments 
Won-null proportions vectors would yield normal theory power for the F test 
of .60 and .80 at each of o« .05 and .01 . 
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Table 9 — Empirical power for the Q and F tests for designs with 
6 treatments * Non-null proportions vectors would yield normal theory 
power for the F test of .60 and .80 at each of a 8 .05 and .01. 
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Table 10 — Empirical power for the Q and F tests for designs with 
10 treatments. Non-null proportions vectors would yield normal theory 
power for the F test of .60 and .80 at each of a * .05 and .01. 
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